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Summary. In this paper, we present a method to optimise rough set partition 
sizes, to which rule extraction is performed on HIV data. The genetic algorithm 
optimisation technique is used to determine the partition sizes of a rough set in 
order to maximise the rough sets prediction accuracy. The proposed method is tested 
' on a set of demographic properties of individuals obtained from the South African 

antenatal survey. Six demographic variables were used in the analysis, these variables 
are; race, age of mother, education, gravidity, parity, and age of father, with the 
y-^ outcome or decision being either HIV positive or negative. Rough set theory is chosen 

based on the fact that it is easy to interpret the extracted rules. The prediction 
t/3 accuracy of equal width bin partitioning is 57.7% while the accuracy achieved after 

optimising the partitions is 72.8%. Several other methods have been used to analyse 
the HIV data and their results are stated and compared to that of rough set theory 
(RST). 
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1 Introduction 



In the last 20 years, over 60 million people have been infected with HIV (Hu- 
man immunodeficiency virus), and of those cases, 95% are in developing coun- 
ts — tries . In 2006 alone, an estimated 39.5 million people around the world were 

living with HIV, with 27.5 million of those people living in Sub-Saharan Africa. 
^ During this year, AIDS (Acquired Immune Deficiency Syndrome) claimed an 

estimated 2.9 million lives [2]. HIV has been identified as the cause of AIDS. 
The effect of AIDS is not only detrimental to the individual infected but has 
a devastating effect on the economic, social, security and demographic levels 
of a country. Because AIDS is killing people in the prime of their working 
and parenting lives, it represents a grave threat to economic development. In 
the worst affected countries, the epidemic has already reversed many of the 
development achievements of the past generation [2 . There are many other 
negative economic effects of AIDS, it has a large negative impact on the social 
and security levels of a country. Social levels drop as the health and educa- 
tional development, that is supposed to benefit poor people, is impeded as 
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well as the average life expectancy drops. It is estimated that by 2010 the 
number of orphans is expected to double from that in 2006 [J] . 

Early studies on HIV/ AIDS focused on the individual characteristics and 
behaviours in determining HIV risk, Fee and Krieger refer to this as biomedical 
individualism [3] . But it has been determined that the study of the distribution 
of health outcomes and their social determinants is of more importance, this is 
referred to as social epidemiology This study uses individual characteristics 
as well as social and demographic factors in determining the risk of HIV. 

It is thus evident from above that the analysis of HIV is of the utmost im- 
portance. By correctly forecasting HIV, the causal interpretations of a patients 
being seropositive (infected by HIV) is made much easier. Previously, com- 
putational intelligence techniques have been used extensively to analyse HIV. 
Leke et al have used autoencoder network classifiers, inverse neural networks, 
as well as conventional feedforward neural networks to analyse HIV El IZ] , 
they used the inverse neural network for adaptive control of HIV status to 
understand how the demographic factors affect the risk of HIV infections [7] . 

Although an accuracy of 92% is achieved when using the autoencoder 
method [5], it is disadvantageous due to its "black box" nature, this also ap- 
plies to the other mentioned neural network techniques. Neural network con- 
nection weights and transfer functions are frozen upon completion of training 
of the neural network [S]. Neural networks offer accuracy over analysis of data, 
but in the case of analysing HIV data, it can be argued that interpretability 
of the data is of more importance than just prediction. It is due to this fact 
that rough set theory (RST) is proposed to forecast and interpret the causal 
effects of HIV. 

Rough sets have been used in various biomedical applications [H [101 [H] , 
other applications of RST include the prediction of aircraft component failure, 
fault diagnosis and stock market analysis [12l[T3j[T4]. But in most applications, 
RST is used primarily for prediction. Rowland et al compared the use of RST 
and neural networks for the prediction of ambulation spinal cord injury |15j . 
and although the neural network method produced more accurate results, 
its "black box" nature makes it impractical for the use of rule extraction 
problems. 

Poundstone et al related demographic properties to the spread of HIV. 
In their work they justified the use of demographic properties to create a 
model to predict HIV from a given database, as is done in this study. RST 
uses the social and demographic factors to predict HIV status, this in turn 
provides insight into which variables are most sensitive in determining HIV 
status. For example, if 90% of HIV positive cases have limited and/or no 
education, whereas 85% of HIV negative cases have at least secondary school 
education, this would clearly indicate that by improving the nations education, 
the percentage of seropositive patients should decrease. 

In order to achieve the best accuracy, the rough set partitions or discreti- 
sation process needs to be optimised. The optimisation is done by a genetic 
algorithm (GA), where the fitness function aims to achieve the highest accu- 
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racy produced by the rough set. Literature reviews have shown that hmited 
work has been done on the optimisation of rough set partition sizes. 

The background of the topic is stated in section [2] a discussion on rough 
set theory and the formulation of the rough sets from which rules are extracted 
are given in section [3] Section |4] explains how the genetic algorithm is used to 
optimise the rough set partitions, and then in section [5] the results obtained 
for partitioning the data using equal width bin are compared to that of the 
results obtained when optimising the partition sizes using a GA. 

2 Background 

Rough set theory was introduced by Zdzislaw Pawlak in the early 1980s [16]. 
RST is a mathematical tool which deals with vagueness and uncertainty. It is 
of fundamental importance to artificial intelligence (AI) and cognitive science 
and is highly applicable to this study performing the task of machine learning 
and decision analysis. Rough sets are useful in the analysis of decisions in 
which there are inconsistencies. To cope with these inconsistencies, lower and 
upper approximations of decision classes are defined jlT] . Rough set theory is 
often contrasted to compete with fuzzy set theory (FST), but it in fact com- 
plements it [TO. One of the advantages of RST is it does not require a priori 
knowledge about the data set, and it is for this reason that statistical methods 
are not sufficient for determining the relationship between the demographic 
variables and their respective outcomes. 

The data set used in this paper was obtained from the South African ante- 
natal sero-prevalence survey of 2001. The data was obtained through question- 
naires completed by pregnant women attending selected public clinics and was 
conducted concurrently across all nine provinces in South Africa. The sentinel 
population for the study only included pregnant women attending an antena- 
tal clinic for the first time during their current pregnancy. The choice of the 
first antenatal visit is made to minimise the chance for one woman attending 
two clinics and being included in the study more than once |18] . 

The six demographic variables considered are: race, age of mother, educa- 
tion, gravidity, parity and, age of father, with the outcome or decision being 
either HIV positive or negative. 

The HIV status is the decision represented in binary form as either a 
or 1, with a representing HIV negative and a 1 representing HIV positive. 
The input data was discretised into four partitions. This number was chosen 
as is gave a good balance between computational efficiency and accuracy. The 
race attribute is presented on a scale 1 to 4, where the numbers represent 
White, African, Coloured and Asian respectively. The parents ages are given 
and discretised accordingly, education is given as an integer, where 13 is the 
highest level of education, indicating tertiary education. Gravidity is defined 
as the number of times that a woman has been pregnant, whereas parity is 
defined as the number of times that she has given birth. It must be noted 
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that multiple births during a pregnancy are indicated with a parity of one. 
Gravidity and parity also provide a good indication of the reproductive health 
of pregnant women in South Africa. 

3 Rough Set Theory and Rough Set Formulation 

Rough set theory deals with the approximation of sets that are difficult to 
describe with the available information |10j . It deals predominantly with the 
classification of imprecise, uncertain or incomplete information. Some concepts 
that are fundamental to RST theory are given below. 

3.1 Information Table 

The data is represented using an information table, an example for the HIV 
data set for the ith object is given below: 



Table 1: Information Table of the HIV Data. 





Race 


Mothers Age 


Education 


Gravidity 


Parity 


Fathers Age 


HIV Status 




2 


32 


13 


1 


1 


22 


1 




3 


22 


5 


2 


1 


25 


1 


Obj^^^ 


1 


35 


6 


1 





33 





Obj'^'^ 


2 


27 


9 


3 


2 


30 






In the information table, each row represents a new case (or object). Be- 
sides HIV Status, each of the columns represent the respective case's variables 
(or condition attributes) . The HIV Status is the outcome (also called the con- 
cept or decision attribute) of each object. The outcome contains either a 1 or 
0, and this indicates whether the particular case is infected with HIV or not. 

3.2 Information System 

Once the information table is obtained, the data is discretised into four par- 
titions as mentioned earlier. An information system can be understood by a 
pair A = (U,A), where U and A, are finite, non-empty sets called the universe, 
and the set of attributes, respectively [llj. 

For every attribute a ^ A, we associate a set Va, of its values, where Va is 
called the value set of a. 

a:V^Va (1) 

Any subset 5 of ^ determines a binary relation 1(B) on XJ, which is called 
an indiscernibility relation. This concept will be explained below. 
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3.3 Indiscernibility Relation 

The main concept of rough set theory is an indiscernibiUty relation (indiscerni- 
bility meaning indistinguishable from one another) . Sets that are indiscernible 
are called elementary sets, and these are considered the building blocks of 
RST's knowledge of reality. A union of elementary sets is called a crisp set, 
while any other sets are referred to as rough or vague. 

More formally, for a given information system A, then for any subset BCA, 
there is an associated equivalence relation 1(B) called the B-indiscernibility 
relation and is represented as shown in [2] below: 

{x,y) e 1(B) iff aix)^a{y) (2) 

RST offers a tool to deal with indiscernibility, the way in which it works 
is, for each concept /decision X, the greatest definable set containing X and 
the least definable set containing X are computed. These two sets are called 
the lower and upper approximation respectively. 

3.4 Lower and Upper Approximations 

The sets of cases/objects with the same outcome variable are assembled to- 
gether. This is done by looking at the "purity" of the particular objects at- 
tributes in relation to its outcome. In most cases it is not possible to define 
cases into crisp sets, in such instances lower and upper approximation sets are 
defined. 

The lower approximation is defined as the collection of cases whose equiva- 
lence classes are fully contained in the set of cases we want to approximate |10j . 
The lower approximation of set X is denoted JBX and mathematically it is 
represented as: 

BX = {x e V : B{x)CX} (3) 

The upper approximation is defined as the collection of cases whose equiv- 
alence classes are at least partially contained in the set of cases we want to 
approximate [in|. The upper approximation of set X is denoted BX and is 
mathematically represented as: 

BX={xeV : B{x)nX:f^ 0} (4) 

It is through these lower and upper approximations that any rough set is 
defined. Lower and upper approximations are defined differently in literature, 
but it follows that a crisp set is only defined for BX = BX. 

It must be noted that for most cases in RST, reducts are generated to 
enable us to discard functionally redundant information [TS]. And although 
reducts are one of the main advantages of RST, it is ignored for the purpose 
of this paper, i.e. the optimisation of discretised partitions. 
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3.5 Rough Membership Function 

The rough membership function is described; fi^ : U [0, 1] that, when 
applied to object x, quantifies the degree of relative overlap between the set 
X and the indiscernibility set to which x belongs. This membership function 
is a measure of the plausibility of which an object x belongs to set X. This 
membership function is defined as: 

. _ \[x]Bnx\ 

[X\b 



3.6 Rough Set Accuracy 

The accuracy of rough sets provides a measure of how closely the rough set 
is approximating the target set. It is defined as the ratio of the number of 
objects which can be positively placed in X to the number of objects that 
can be possibly be placed in X. In other words it is defined as the number of 
cases in the lower approximation, divided by the number of cases in the upper 
approximation; < ap{X) < 1 

aJX) = H (6) 
' \BX\ ^ ' 



3.7 Rough Sets Formulation 

The process of modelling the rough set can be broken down into five stages; 

The first stage would be to select the data. The data to be used is obtained 
from the South African antenatal survey of 2001 [18 . 

The second stage involves pre-processing the data to ensure it is ready for 
analysis, this stage involves discretising the data and removing unnecessary 
data (cleaning the data). Although the optimal selection of set sizes for the 
discretisation of attributes will not be known at first, an optimisation tech- 
nique (genetic algorithm) will be run on the set to ensure the highest degree 
of accuracy when forecasting outcomes. This will be explained more clearly 
below and is illustrated in figure [l] 

If reducts were considered, the third stage would be to use the cleaned 
data to generate reducts. A reduct is the most concise way in which we can 
discern object classes |19j . In other words, a reduct is the minimal subset of 
attributes that enables the same classification of elements of the universe as 
the whole set of attributes |16j . To cope with inconsistencies, lower and upper 
approximations of decision classes are defined [TTJ [TSJ [T71 [T^ . 

Stage four is where the rules are extracted or generated. The rules are nor- 
mally determined based on condition attributes values [50] • Once the rules are 
extracted, they can be presented in an if CONDITION(S)-f/ien DECISION 
format fSl]. 



GA's to Optimise Rough Sets 



7 



The final or fifth stage involves testing the newly created rules on a test 
set. The accuracy will be noted and sent back into the genetic algorithm in 
step two and the process will continue until the optimum or highest accuracy 
is achieved. 

Pre-processing Data 

As with many surveys, there is missing and/or incorrect data. This data needs 
to be cleaned before any processing can be performed on it. The first irregu- 
larity would be the case of missing data. This could be due to the fact that 
surveyees may have omitted certain information, it could also be attributed 
to the errors being made when the data was entered onto the computer. Such 
cases are removed from the data set. The second irregularity would be in- 
formation that is false. Such an instance would be if gravidity was zero and 
parity was at least one. Gravidity is defined as the number of times that a 
woman has been pregnant, and parity is defined as the number of times that 
she has given birth. Therefore it is impossible for a woman to have given 
birth, given she has not been pregnant, such cases are removed from the data 
set. As mentioned earlier, multiple births are still indicated with a parity of 
one, therefore if parity is greater than gravidity, that particular case is also 
removed from the data set. Only 12945 cases remained from a total of 13087. 

Rule Extraction 

Once RST was applied to the HIV data, 329 unique distinguishable cases and 
123 indiscernible cases were extracted. From the data set of 12945 cases, the 
data is only a representative of 452 cases out of the possible 4096 unique 
combinations. From[6]the accuracy of the rough set is calculated to be 72.8%. 
The 329 cases of the lower approximation are rules that always hold, or are 
definite cases. The 123 cases of the upper approximation can only be stated 
with a certain plausibihty. Examples of both cases are stated below: 

Lower Approximation Rules 

1. If Race — African and Mothers Age — 23 and Education = 4 and Gra- 
vidity = 2 and Parity = 1 and Fathers Age = 20 Then HIV = Most 
Probably Positive 

2. If Race = Asian and Mothers Age = 30 and Education = 13 and Gra- 
vidity = 1 and Parity = 1 and Fathers Age — 33 Then HIV — Most 
Probably Negative 

Upper Approximation Rules 

1. If Race = Coloured and Mothers Age = 33 and Education ~ 7 and 
Gravidity — 1 and Parity — 1 and Fathers Age = 30 Then HIV = 
Positive with plausibility — 0.33333 

2. If Race ~ White and Mothers Age = 20 and Education = 5 and Gra- 
vidity = 2 and Parity — 1 and Fathers Age = 20 Then HIV = Positive 
with plausibility = 0.06666 



8 Bodie Crossingham and Tshilidzi Marwala 

4 Genetic Algorithm 

A genetic algorithm (GA) is a stochastic search procedure for combinatorial 
optimisation problems based on the mechanism of natural selection [22 . Ge- 
netic algorithms are a particular class of evolutionary algorithms that use 
techniques inspired by evolutionary biology such as inheritance, mutation, se- 
lection, and crossover. The fitness/evaluation function is the only part of the 
GA that has any knowledge about the problem. The fitness function tries 
to maximise the accuracy of the rough set . Figure [T] illustrates the process of 
computing the rough sets simultaneously with the GA optimising the partition 
sizes. 



stage 1 : Obtain HIV 
Data Set 



Stage 2 : Pre- 
Process Data- 
Clean and 
Discretise Data 
According to 
Genetic Algoithm 



Stage 3 : Compute 
Lower and Upper 
Approximations 



Stage 4 : Generate 
Rule Set 



Stage 5 : Test Rule 
Validity On Test 
Data 



Genetic Algorithm; 
Optimises 
Discretised 
Partitions 



Fig. 1: Block Diagram of the Sequence of Events in Modelling MID 



The pseudo-code algorithm for genetic algorithms is given below; 

1. Initialise a population of chromosomes 

2. Evaluate each chromosome (individual) in the population 

a) Create new chromosomes by mating chromosomes in the current pop- 
ulation (using crossover and mutation) 

b) Delete members of the existing population to make way for the new 
members 

c) Evaluate the new members and insert them into the population 

3. Repeat stage 2 until some termination condition is reached, in this case 
until 100 generations were reached. 

4. Return the best chromosome as the solution 

As selection functions, mutation and crossover functions are relevant to 
each specific problem, for this purpose of this paper, the best results were 
obtained using normal geometric selection, a uniform mutation and cyclic 
crossover, an initial population of 20 individuals was chosen. GAs also may 
prematurely converge to a local minimum, but they do incorporate a diversi- 
fication mechanism to avoid this, the mechanism used is mutation. 



5 Results Obtained 
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The accuracy of the rough set was calculated for two cases, the first case was 
for when the partitions were discretised manually into equal width bins, and 
the second case was when the partition sizes were chosen optimally by imple- 
menting a GA. The first case yielded 225 cases, of which there were 130 unique 
discernible cases and 95 indiscernible cases. This represents an accuracy of 
57.7%. The second case, yielded 452 cases. 329 of the cases were discernible 
while 123 were indiscernible. This produced an accuracy of 72.8%. The results 
are clearly better for the optimised case. As a result of implementing RST on 
the data set, the rules extracted arc explicit and easily interpreted. RST will 
however compromise accuracy over rule interpretability, and this is brought 
about in the discretisation process where the granularity of the variables are 
decreased. 



6 Conclusion 

A genetic algorithm was successfully applied to RST on the HIV data set. 
Although RST does not produce accuracies as high as those of other previ- 
ous computational intelligence methods, it does however produce explicit and 
easy-to-interpret rules. An accuracy of 72.8% was produced by the rough set 
when applied to the HIV data set. The GA optimisation method produced 
good results but GAs may prematurely converge towards local optima. Rec- 
ommendations for future work include the application of other optimisation 
techniques such as particle swarm optimisation (PSO). PSO is advantageous 
over GAs as it is easy to implement and there are fewer parameters to adjust. 
Different divergence mechanisms such as elitism can also be considered for a 
possible increase in accuracy. 
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